A new gene tree algorithm employing DNA sequences of bovine genome using discrete Fourier transformation

Within the realms of human thoughts on nature, Fourier analysis is considered as one of the greatest ideas currently put forwarded. The Fourier transform shows that any periodic function can be rewritten as the sum of sinusoidal functions. Having a Fourier transform view on real-world problems like the DNA sequence of genes, would make things intuitively simple to understand in comparison with their initial formal domain view. In this study we used discrete Fourier transform (DFT) on DNA sequences of a set of genes in the bovine genome known to govern milk production, in order to develop a new gene clustering algorithm. The implementation of this algorithm is very user-friendly and requires only simple routine mathematical operations. By transforming the configuration of gene sequences into frequency domain, we sought to elucidate important features and reveal hidden gene properties. This is biologically appealing since no information is lost via this transformation and we are therefore not reducing the number of degrees of freedom. The results from different clustering methods were integrated using evidence accumulation algorithms to provide in insilico validation of our results. We propose using candidate gene sequences accompanied by other genes of biologically unknown function. These will then be assigned some degree of relevant annotation by using our proposed algorithm. Current knowledge in biological gene clustering investigation is also lacking, and so DFT-based methods will help shine a light on use of these algorithms for biological insight.


Introduction
Periodic patterns can be tracked across biological sequences. By locating and characterizing DNA patterns, some biologically motivated information can be extracted about the structure and function of the DNA sequence [1,2]. To analyze DNA sequence it should first be converted into numerical sequence. Different mapping functions have been defined for this purpose [3].The simplest one is binary mapping that should bear up biological implication like electron-ion-interaction pseudo-potential (EIIP) mapping rule [4]. A new binary mapping a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 function based on weak-strong hydrogen bonding has also been defined [5]. The paper by Arniker and Kwan (2012) thoroughly covers most of the main mapping functions. From a digital signal processing (DSP) point of view, a DNA sequence is a one-dimensional signal and it is therefore possible to apply many signal processing algorithms to explore DNA sequence data [6]. When a new DNA sequence with unknown function is identified, generally researchers quest for the most 'similar' or 'identical' DNA sequence in a well-annotated biological dataset. This is the accepted routine, since identical DNA sequences tend to reflect identical biological functionality and forms the basis for determining whether the DNA sequences are homologous, e.g., there is shared ancestry between them. DSP based methods have been used to locate reading frames in DNA, including different gene regions (exons) [7], to detect splice sites within the gene [8], to identify active sites in a protein, to identify acceptor splicing sites and motif patterns in DNA [9,10]. DFT is the most popular DSP technique [7]. In general, the fast Fourier transform was developed to compute the DFT. In fact, DFT has been used to analyze DNA sequence data [8]. Discrete Fourier Transform (DFT) is a powerful tool in many practical applications, such as analysis of biological events. For example, DFT has been used to develop a measure of DNA sequence similarity [11], to quantify the physical behavior of biological systems [12], as well as to cluster and predict gene expression [13][14][15]. Recently, Hoang et al. (2015) proposed a new Fourier power spectrum based method to cluster DNA sequences in which sequences of different length could be easily compared. The spectrum estimation is mainly performed in random processes in which random variables cannot be thoroughly explored by means of deterministic tools. In this study we used DFT to cluster the major bovine milk genes. The nature of this method can be applied to data consisting of any sized DNA. However, Fourier analysis that is well-adapted to the study of biological data can be used instead of conventional modeling techniques. It would seem that its application and utility shall increase as more and more high-throughput data become available.

A simple illustration of DFT of DNA sequence
The Fourier transform (FT) of a continuous-time function x(t) can be shown as follows [8]: . Since the Fourier transform is an invertible function, the inverse of FT is given by: The Discrete Fourier Transform (DFT) of x [n] with period N at the point of k is defined as below [8]: Also, the inverse of DFT is given by: Fundamental to both continuous and discrete Fourier transform are the following equations: Now suppose we have the following gene: G = {a,g,c,t,t,t,a}. This postulated gene, has 7 nucleotides (bases). Here, the conversion of DNA sequence to a numerical sequence is necessary. We can define an arbitrary and randomly value-assignment scheme such as: 5 to A, 6 to C, 7 to G and 8 to T. However, it would be better to write down a value-assignment scheme that converts a DNA sequence into four binary sequences. This scheme, we call it G−scheme, generates x[n] sequence in which i−th item of this sequence, x i [n], can be computed by: Therefore, for the above example, we consider that N = 7, then, we could have the following binary indicator sequences: By computing the DFT of the w n , we have: W½k� ¼ f2; 1:623 þ 0:782j; 0:777 þ 0:975j; 0:099 þ 0:434j; 0:099À 0:434j; 0:777À 0:975j; 1:623À 0:782jg.
For each DNA bases, we can write down a spectrum and DFT can be applied. In the above, just G[a] = {1, 0, 0, 1} is shown as DFT. In this study, we use DFT to handle the DNA sequences of a set of candidate genes governing milk production in cattle. To the best of our knowledge, there have been no previous studies in this area.

Data extraction and conversion
In the study, we randomly selected 30 genes involved in bovine milk production [16]. The DNA sequences of these genes were obtained from the NCBI genome database (http://www. ncbi.nlm.nih.gov/genbank/gene) and downloaded as FASTA format. To extract all gene features (e.g. exon number, exon length, gen length, etc) automatically, we developed a standalone program using C Sharp programming language. This simple program facilitated retrieval of gene features from stored DNA sequence data. This software is available from authors upon request. In this study 30 genes that are functioning in milk production (based upon 6 classifications), were randomly selected.

Gene clustering by means of DFT
After converting gene sequences into respective binary indicator sequences using binary mapping, they were discrete Fourier transformed. In this way, we used the idea that was proposed by [17]. Simply, we first tried to find DFT of an indicator binary sequence at frequency k (k = 0, . . ., N−1) and later on, the power spectrum of that indicator sequence at frequency k was obtained. For comparing a set of numerical sequences, those sequences should be of the same length in order to for example a metric measure like Euclidian distance could be applicable to find their distance with each other. As Table 1 in this paper shows, the length of applied genes, therefore, their respective binary indicator sequences were quite different. To tackle with this issue, we used new mathematical moments e.g. Power Spectrum Moments (PSM) method by [17]. The result obtained by PSM was used as an entry into four main agglomerative hierarchical clustering algorithms e.g. nearest distance (single linkage method), furthest distance (complete linkage method), un-weighted pair group method average (UPGMA, group average), weighted pair group method centroid (WPGMC) and gene trees were obtained. In order to combine the results of these four gene trees into single one (sort of ensemble clustering), we applied a so called evidence accumulation [18]. This was done to tackle with sensitivity with distance measure as it was hard to find the proper one, possible noise and outlier and cluster type, shape and density of single algorithm. In this way, we converted clustering results of an algorithm into binary distance matrices and evidence accumulation was obtained from collective of the binary distance matrices and finally a hierarchical clustering algorithm was used to learn the ensemble-like cluster. At the end, the GeneMANIA prediction tool was used to check the evidence accumulation results with regard biological function [19]. To do so, a gene list of every singleton of final ensemble cluster was obtained and imported to GeneMANIA. This program tries to find those genes that are functionally similar, or have shared properties with the list of imported genes, and displays an interactive functional association network at seven levels: predicted, physical interactions, co-expression, co-localization, pathway, shared protein domains and genetic interactions. All computations were performed using MATLAB, Version 2015.

Results and discussion
Generally, Fourier based analysis needs long series data in conjunction with high sampling density. However, this is too much demanding for majority of standard biochemical experiments. In this way, the length of a time series biological experiment cannot be readily extended. However, DNA data, almost lack of these restrictions and working with them are almost simple. Data heterogeneity is intrinsic to the nature of DNA sequence data. Table 1 highlights this issue. Table 1 was created using a package developed within C# to overcome most of the problems caused by data heterogeneity in a way that downstream analysis is readily possible. The Figs 1 and 2 show some general information about the data used in this study in terms of DFT. In this study, STX3 (79347 bp) and CD14 (1417 bp) genes were the longest and shortest genes respectively (Table 1). Overall exon number for all investigated genes was 258, with exon 1 of HSPA1A (2101 bp) and HSPA5 (20 bp) genes being the longest and shortest exons respectively. Also, LCP1 and ABCG2 genes with 16 exons, and YWHAG and HSPA1A

PLOS ONE
genes with 1 exon contained the highest and lowest number of exons respectively. Table 2 reflects the four gene groups that were extracted from Fig 3. In this sense, it says that 30 genes were grouped in four gene groups. For each group we used GeneMANIA server to get some clue about their relatedness from biological point of view. Exploration of Fig 4 would indicate that the genes were almost soundly grouped solely based on DNA sequences converted to DFT. Fig 5 show the gene tree results after using Average, Complete, Median and Single algorithms. They show high degree topologies in terms of grouping genes that are tightly clustered over these four algorithms. In this way, the gene tree results due to both Median and Complete algorithms show a high degree of similarity for tightly clustered genes. Visually, it does seem that the clustering algorithms have discovered an obvious pattern of gene grouping. However,

PLOS ONE
in general, we are not expecting hierarchical algorithms to define specific clusters, but rather to define the dendrogram. From any dendrogram we can decipher the proximity or distance between any two groups of genes by looking at the clustering height at which the two groups split. To define clusters, we need to "cut the tree" at some distance and group all samples that are within that distance into groups below. In general, we draw a horizontal line at the height of the cluster that we wish to cut and this defines gene groups. We selected an arbitrary ad hoc method for tree cutting that resulted in four gene groups. There is the potential to increase the effective number of gene groups by exploring different ways to select the minimum gene cluster size. It might be better to select the minimum cluster size as a function of the gene dataset size, or perhaps develop a completely separate method for selecting the most appropriate gene group size. In this way, it would be beneficial to make use of a hybrid tree cut [20]. The hybrid tree cut algorithm, similar to dynamic tree cut, creates well shaped clusters with dense cores of tightly lumped nodes and few outliers. It is acknowledged that this algorithm could potentially

PLOS ONE
create clusters in which variables (e.g. genes) are more closely related and thus be more useful for prediction. Gene tree integration by means of evidence accumulation resulted in a gene tree containing four clusters. Fig 5 shows the results. The analyses using GeneMANIA, showed that the gene clusters each shared around 43%, 68%, 52% and 68% co-expression with other genes. All genes within each cluster were associated to the same metabolic pathways. It was also observed that gene clusters functioned differently in terms of being involved in different metabolic pathways. These results could support our gene DNA sequence DFT-based clustering. However, for further explorations of the results, we applied more agglomerative clustering algorithms e.g. complete, median and single to learn the DFT cluster (Fig 5).  Table 3 show the final result of combining the four clustering algorithms. For this set of genes, we also tried, in another study, to come up with a gene tree using information theory formalism. In this way, for each gene, the Shannon entropy was calculated in orders one to

PLOS ONE
four miming the Markov chain up to order three orders and, for each gene, the relative Shannon entropy or Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for the genes, the results were entered into seven clustering algorithms: single, complete, average, weighted, centroid, median, and k-means. Based on metabolic pathway annotations we saw that the proposed clustering method yielded logical and fast results. This method also doesn't have the disadvantages of aligning allowed the genes with actual length and content to be considered and also didn't require high compute memory for long sequences [21]. Different orders of entropies were used to derive information-based gene clustering. These authors used the same in silico method of results confirmation as we did in this study PLOS ONE [22]. In silico confirmation of biological function has found its way in gene association studies, nucleotide polymorphism detection, differential gene expression analysis and novel gene prediction [23][24][25]. However, there are some discrepancies between the Shannon entropy results and current results, though, with both entropy-based and DFT-based methods, we got logical results only from gene sequence data clustering. The mathematical formalism of Fourier transform has recently been used in biology, mostly in gene expression clustering [14,17,26] rather than with individual gene sequences [17]. Lack of thorough biological confirmation, even at the in silico level, would remain a big caveat of these explorations. In this study STX3, ABCG2 and GNB1 cluster together (Fig 6). If we look at the biological knowledge relating to each individual gene, we can see their different molecular functions (https://www.genecards.org/).   STX3 is a member of the syntaxin family and is involved in Sertoli-Sertoli cell junction dynamics and synaptic vesicle cycle pathways. ABCG2 is included in the superfamily of ATP-binding cassette (ABC) transporters. GNB1 encodes a beta subunit of guanine nucleotide binding proteins and is an important regulator of the alpha subunits that constitute these proteins. It is associated with signal transduction receptors and effectors and alternative splicing. We believe that by jointly clustering time course gene expression data with their gene DNA sequences will elucidate how these genes function in a way that explains why they are seen to cluster together. Therefore, there are some limitation to both DNA based or time course gene expression individual clustering. For example, fluctuations or changes in gene expression may not necessarily be an indicator of changes in protein levels, since there are so many modifiers involved e.g. post-transcriptional regulation (microRNA mediated regulation). We propose that combing the gene entropy based method and time course gene expression data with the gene DFT method could be an interesting proposal for future research. The simple DFT method applied here can be run easily with the Fast Fourier Transform Algorithm (FFT). FFT is the workhorse of the DFT method. This algorithm can be trained to work readily on these combined levels of data.

Conclusions
To best of our knowledge, we are the first to use a DFT-based algorithm to group bovine genes. One possibility worth exploring which would surely influence the effectiveness of the proposed algorithm is the selection of distance measure. As a matter of fact, the distance matrix is very important in the creation of a dendrogram and thus the resulting gene clusters. Selecting the best possible biologically-motivated distance matrix has an important impact on the results. In this study we used Euclidian distance matrix to build up the DFT-based gene trees. A future study would be to investigate the accuracy of the DFT-based gene clustering algorithm with different distance measures. The gene cut size is important. There were some minor discrepancies in the performance of the learned gene tree results, however, we would expect this when the number of genes increases. Gene tree results will most likely be

PLOS ONE
inconsistent over many large gene datasets. This might imply that there is no unique optimal minimum gene cluster size for different large datasets. This matter would certainly be worth investigating to test the validity of this notion, and to determine a way of selecting this optimal minimum cluster gene size.